Lesson: Identifying Duplicate Records 


Speed — Optimal processing time is achieved with many small break groups; however, 
valid matches may not be identified if break groups are too small. 


Match quality — Optimal match quality is achieved with fewer and larger break groups; 
however, larger break groups require more comparisons and processing time. 


Controlling the number of record comparisons in the matching process is important for 
performance. Break groups limit the number of comparisons performed during the matching 
process, because matching is only considered within break groups, not between them. Break 
groups are established by defining criteria called a break key. 


Defining an appropriate break key can save valuable processing time by preventing widely 
divergent data from being compared. Break keys should group records that would most likely 
contain matches. Fields commonly used for creating break groups are postcodes, account or 
identification numbers, or the first two positions of a street name. 


For example, when matching on address data, it is common to use the first three digits of a 
postcode as the break key. Thus, only records that have the same first three digits of a 
postcode become members of a break group. In the following figure, with a break key of the 
first three digits of the postal code, records in Al would be compared to records in A2, but 
never to records in B1 or B2. 





A Figure 27: Setting Up Break Keys 


Once you see how break keys control the number of records that are matched, it is easier to 
follow the match process. 


Creating Break Keys 
e All records 
e Each record belongs to a postal code 


e Set the break key on the first three digits of the postal code 
e Records that contain 809 as the first three digits, form the break group A 
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